Skip to content

[benchmarks] Reworked the conversion benchmark and added more tests for up/down casts #4800

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

AndreyPavlenko
Copy link
Contributor

@AndreyPavlenko AndreyPavlenko commented Jul 29, 2025

Closes #4799

Here is an example of the benchmark report:

float-upcast:
            N  fp8e4nv->fp16  fp8e4nv->bf16  fp8e4nv->fp32  fp8e5->fp16  fp8e5->bf16  fp8e5->fp32  fp16->fp32  bf16->fp32
0      4096.0       0.010524       0.010252       0.010187     0.080630     0.010488     0.043948    0.139510    0.193939
1     16384.0       0.042079       0.041001       0.040740     0.322013     0.041959     0.175794    0.556522    0.771375
2     65536.0       0.168248       0.163971       0.162928     1.286028     0.167731     0.701971    2.223066    3.079699
3    262144.0       0.507677       0.501730       0.493866     5.088199     0.503581     2.786395    8.820458   12.158813
4   1048576.0       1.202495       1.163017       1.182960    12.700775     1.169868     8.868200   30.411137   26.532794
5   4194304.0       1.623786       1.587861       1.594501    21.531335     1.587212    15.298745   50.582537   42.538580
6  16777216.0       1.961367       1.927211       1.924611    33.078107     1.923428    23.340590   83.022644   69.788752
7  67108864.0       2.121850       2.079672       2.081887    35.847220     2.076763    25.215246   95.001223   78.473379
float-downcast:
            N  fp16->fp8e4nv/rtne  fp16->fp8e5/rtne  bf16->fp8e4nv/rtne  bf16->fp8e5/rtne  fp32->fp8e4nv/rtne  fp32->fp8e4nv/rtz  fp32->fp8e5/rtne  fp32->fp8e5/rtz  fp32->fp16/rtne  fp32->fp16/rtz  fp32->bf16/rtne  fp32->bf16/rtz
0      4096.0            0.023953          0.071309            0.020427          0.020877            0.028330           0.202172          0.051561         0.304762         0.436674        0.508189         0.432981        0.135899
1     16384.0            0.095768          0.285237            0.081626          0.083490            0.113290           0.807094          0.206036         1.213630         1.731924        2.017734         1.724632        0.542517
2     65536.0            0.382893          1.138173            0.326310          0.333618            0.452534           3.222026          0.823523         4.825920         6.898526        8.031373         6.855230        2.167196
3    262144.0            1.490979          3.903276            1.226577          1.254518            1.786452          11.270163          3.193762        17.558205        26.969547       29.721542        26.586613        8.594885
4   1048576.0            4.111418          5.797081            2.281100          2.298904            6.448014          29.355431         10.123344        53.335504        60.401843       72.067079        57.174264       18.035363
5   4194304.0            7.612720         12.685410            4.564980          4.603662           12.267993          45.939803         18.057103        85.667974       103.768036      114.161786        97.000555       28.142136
6  16777216.0           10.127255         14.296733            5.533675          5.576271           18.747797          55.520604         26.028944       111.877941       179.781569      162.617195       168.074694       46.045713
7  67108864.0           10.861711         15.523535            5.929267          5.961534           18.644044          57.701748         27.834568       117.730718       204.201753      181.876698       191.695795       49.386877

And the benchmark run results - https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/16596562534/job/46944550498

@AndreyPavlenko AndreyPavlenko force-pushed the AndreyPavlenko/bench-fp8 branch 2 times, most recently from b5381cf to 73136d9 Compare July 29, 2025 13:08
y = tl.cast(y, y_type, bitcast=True)
tl.store(y_ptr + offsets, y, mask=mask)

kernel.__name__ = kernel.__qualname__ = name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we want to change the kernel name here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes much easier finding and comparing the kernel IRs in the cache. For example, I want to compare the rtne and rtz IRs:

$ find ~/.triton/cache/ -name '*fp32_to_fp16*.llir'
/home/jovyan/.triton/cache/PXX2VZY5SPACCMUKHEIEDQEEB2CN2A562JHDMIVPLHTS6F6LSRAA/rtne_fp32_to_fp16_conversion_kernel.llir
/home/jovyan/.triton/cache/CYCEIIP4OQCRWTMZHZHLPXO63ZR5J7ZEPN4KCM72NVOITFO5FMRA/rtz_fp32_to_fp16_conversion_kernel.llir

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep this in develop stage, since from the production perspective, there is not performance or visualization improvement from this change. But it breaks the code structure to be incosistant with other benchmarks, which reduces code readability.
Without this change, we also do not need lru_cache, _make_kernel and could keep using triton.jit as decorator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very convenient using benchmarks for development and testing. This bench produces 20 kernels and it's not easy to recognize the required kernel in the cache.

tl.store(y_ptr + offsets, y, mask=mask)

kernel.__name__ = kernel.__qualname__ = name
return triton.jit(kernel)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using triton.jit in decorator as other benchmarks do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decorator does not allow changing the name.

if r == 'rtne' or s == tl.float32]


@lru_cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triton has its cache mechanism. You can check the kernel jit cache under TRITON_CACHE_DIR if specified. We do not need to use lru_cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but JITFunction object is created on each call to _make_kernel(). I'm not sure if it has significant impact on the bench results.

return gbps(ms), gbps(max_ms), gbps(min_ms)


def _report(plot_name, line_names, line_vals):
report = triton.testing.perf_report(
triton.testing.Benchmark(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's recommended to use triton_kernels_benchmark instead of triton.testing.Benchmark to get the accurate kernel time cost.
It's better to use perf_report and benchmark in python decorator form to keep consistent with other benchmarks.
If you want to add more arguments to the benchmarks, you can refer to

def get_benchmark(
providers_filter: Optional[list[str]] = None,
transpose_a=False,
transpose_b=False,
matmul_kernel=matmul_kernel_with_block_pointers,
matmul_kernel_batched=matmul_kernel_with_block_pointers_batched,
plot_name='matmul-performance',
):
"""
Returns a Mark object containing a Benchmark object constructed at runtime and parameterized by the provided option values.
The benchmark can then be executed by calling the :code:`.run` method on the return value.
"""
supported_providers = {
'triton': 'Triton',
'onednn': 'OneDNN',
}
# use_cutlass
if not (transpose_a or transpose_b):
if torch.xpu.get_device_name() != 'Intel(R) Arc(TM) Graphics':
# FIXME: enable cutlass on LNL
supported_providers['cutlass'] = 'CUTLASS'
providers = benchmark_suite.filter_providers(supported_providers, providers_filter)
# Benchmark Performance
# pylint: disable=too-many-branches
@benchmark_suite.perf_report(
benchmark_suite.Benchmark(
# argument names to use as an x-axis for the plot
x_names=['B', 'M', 'N', 'K'],
# different possible values for `x_name`
x_vals=X_VALS,
line_arg='provider',
# argument name whose value corresponds to a different line in the plot
# possible values for `line_arg``
line_vals=list(providers.keys()),
# label name for the lines
line_names=list(providers.values()),
# line styles
styles=[('green', '-'), ('green', '--'), ('blue', '-'), ('blue', '--')],
ylabel=['GB/s', 'TFlops'], # label name for the y-axis
plot_name=plot_name,
# name for the plot. Used also as a file name for saving the plot.
args={},
))
def benchmark(B, M, N, K, provider):

Copy link
Contributor Author

@AndreyPavlenko AndreyPavlenko Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to triton_kernels_benchmark, but now there are much more results. I've grouped them in 8 benchmarks, one for each pair of types:

fp8e4nv-fp16:
            N  fp8e4nv->fp16-GB/s  fp16->fp8e4nv-rtne-GB/s  fp8e4nv->fp16-GB/s-min  fp16->fp8e4nv-rtne-GB/s-min  fp8e4nv->fp16-GB/s-max  fp16->fp8e4nv-rtne-GB/s-max  fp8e4nv->fp16-CV  fp16->fp8e4nv-rtne-CV
0      4096.0            0.009099                 0.024323                0.007853                     0.024311                0.010699                     0.024335          0.119163               0.000359
1     16384.0            0.042830                 0.097263                0.042824                     0.097246                0.042845                     0.097292          0.000268               0.000246
2     65536.0            0.171211                 0.388892                0.171152                     0.388799                0.171237                     0.388984          0.000311               0.000254
3    262144.0            0.514351                 1.515020                0.514218                     1.512835                0.514573                     1.517318          0.000333               0.001386
4   1048576.0            1.211834                 4.166472                1.206658                     4.166042                1.215926                     4.167631          0.005078               0.000204
5   4194304.0            1.627807                 7.824465                1.627375                     7.666599                1.628487                     7.990247          0.000360               0.021691
6  16777216.0            1.961954                10.152995                1.960651                    10.151963                1.963331                    10.153733          0.000892               0.000100
7  67108864.0            2.121951                10.910362                2.121518                    10.909286                2.122361                    10.911670          0.000233               0.000094
fp8e4nv-bf16:
            N  fp8e4nv->bf16-GB/s  bf16->fp8e4nv-rtne-GB/s  fp8e4nv->bf16-GB/s-min  bf16->fp8e4nv-rtne-GB/s-min  fp8e4nv->bf16-GB/s-max  bf16->fp8e4nv-rtne-GB/s-max  fp8e4nv->bf16-CV  bf16->fp8e4nv-rtne-CV
0      4096.0            0.010431                 0.020689                0.010428                     0.020679                0.010432                     0.020704          0.000189               0.000471
1     16384.0            0.041715                 0.082722                0.041694                     0.082707                0.041728                     0.082747          0.000361               0.000286
2     65536.0            0.166801                 0.330556                0.166707                     0.330429                0.166843                     0.330723          0.000373               0.000418
3    262144.0            0.508169                 1.243508                0.507339                     1.241306                0.508504                     1.246121          0.001142               0.001853
4   1048576.0            1.170913                 2.298198                1.170286                     2.297332                1.171415                     2.299105          0.000427               0.000372
5   4194304.0            1.590390                 4.414360                1.588078                     4.253789                1.592187                     4.589115          0.001110               0.040270
6  16777216.0            1.927215                 5.541919                1.925484                     5.541688                1.928841                     5.542215          0.000827               0.000049
7  67108864.0            2.078866                 5.955846                2.078395                     5.955534                2.079470                     5.956143          0.000249               0.000041
fp8e4nv-fp32:
            N  fp8e4nv->fp32-GB/s  fp32->fp8e4nv-rtne-GB/s  fp32->fp8e4nv-rtz-GB/s  fp8e4nv->fp32-GB/s-min  fp32->fp8e4nv-rtne-GB/s-min  fp32->fp8e4nv-rtz-GB/s-min  fp8e4nv->fp32-GB/s-max  fp32->fp8e4nv-rtne-GB/s-max  fp32->fp8e4nv-rtz-GB/s-max  fp8e4nv->fp32-CV  fp32->fp8e4nv-rtne-CV  fp32->fp8e4nv-rtz-CV
0      4096.0            0.010362                 0.028596                0.220511                0.010355                     0.028587                    0.220215                0.010369                     0.028603                    0.220785          0.000572               0.000277              0.001602
1     16384.0            0.041439                 0.114354                0.879678                0.041421                     0.114350                    0.878593                0.041457                     0.114356                    0.880860          0.000401               0.000099              0.001598
2     65536.0            0.165713                 0.456808                3.507412                0.165682                     0.456735                    3.506849                0.165763                     0.456913                    3.508351          0.000286               0.000255              0.000757
3    262144.0            0.500332                 1.804281               12.206938                0.499604                     1.803710                   12.181413                0.500887                     1.805399                   12.231430          0.001165               0.000460              0.001846
4   1048576.0            1.193028                 6.527287               31.324152                1.192344                     6.523916                   31.237370                1.194213                     6.531394                   31.402012          0.001297               0.000556              0.002341
5   4194304.0            1.599378                12.358361               46.772281                1.598380                    12.115543                   46.425925                1.600313                    12.566826                   47.254437          0.000648               0.014158              0.025110
6  16777216.0            1.926104                18.816976               55.924057                1.925887                    18.813263                   55.880840                1.926340                    18.820691                   55.949418          0.000105               0.000181              0.000633
7  67108864.0            2.082396                18.749761               57.819820                2.082237                    18.684158                   57.809411                2.082518                    18.789707                   57.828937          0.000069               0.002619              0.000188
fp8e5-fp16:
            N  fp8e5->fp16-GB/s  fp16->fp8e5-rtne-GB/s  fp8e5->fp16-GB/s-min  fp16->fp8e5-rtne-GB/s-min  fp8e5->fp16-GB/s-max  fp16->fp8e5-rtne-GB/s-max  fp8e5->fp16-CV  fp16->fp8e5-rtne-CV
0      4096.0          0.090101               0.073537              0.089762                   0.073458              0.090459                   0.073584        0.003488             0.001016
1     16384.0          0.360405               0.293673              0.359298                   0.293326              0.360817                   0.293915        0.002256             0.001314
2     65536.0          1.435304               1.172799              1.434174                   1.171625              1.437193                   1.173639        0.003717             0.001082
3    262144.0          5.674113               4.018765              5.645762                   4.015686              5.688889                   4.020613        0.004035             0.000635
4   1048576.0         12.052598               5.826393             12.042631                   5.817665             12.069245                   5.833200        0.001300             0.001061
5   4194304.0         20.661597              11.960147             19.493159                  11.722220             21.315986                  12.711861        0.058014             0.036636
6  16777216.0         29.556085              14.154166             29.527371                  14.151587             29.590698                  14.155981        0.000950             0.000153
7  67108864.0         31.602352              15.201539             31.593605                  15.141837             31.612656                  15.238494        0.000372             0.003269
fp8e5-bf16:
            N  fp8e5->bf16-GB/s  bf16->fp8e5-rtne-GB/s  fp8e5->bf16-GB/s-min  bf16->fp8e5-rtne-GB/s-min  fp8e5->bf16-GB/s-max  bf16->fp8e5-rtne-GB/s-max  fp8e5->bf16-CV  bf16->fp8e5-rtne-CV
0      4096.0          0.010617               0.020971              0.010614                   0.020965              0.010618                   0.020975        0.000192             0.000310
1     16384.0          0.042465               0.083883              0.042451                   0.083859              0.042476                   0.083900        0.000380             0.000310
2     65536.0          0.169765               0.335034              0.169733                   0.334914              0.169818                   0.335188        0.000293             0.000341
3    262144.0          0.508051               1.238690              0.507874                   1.218866              0.508189                   1.260308        0.000259             0.016453
4   1048576.0          1.166588               2.303297              1.165706                   2.303064              1.167035                   2.303631        0.000671             0.000161
5   4194304.0          1.574403               4.553186              1.572210                   4.525967              1.575782                   4.597888        0.000999             0.027275
6  16777216.0          1.908402               5.553034              1.906225                   5.552781              1.910218                   5.553339        0.000954             0.000059
7  67108864.0          2.059672               5.925414              2.058876                   5.922217              2.060320                   5.930591        0.000308             0.002324
fp8e5-fp32:
            N  fp8e5->fp32-GB/s  fp32->fp8e5-rtne-GB/s  fp32->fp8e5-rtz-GB/s  fp8e5->fp32-GB/s-min  fp32->fp8e5-rtne-GB/s-min  fp32->fp8e5-rtz-GB/s-min  fp8e5->fp32-GB/s-max  fp32->fp8e5-rtne-GB/s-max  fp32->fp8e5-rtz-GB/s-max  fp8e5->fp32-CV  fp32->fp8e5-rtne-CV  fp32->fp8e5-rtz-CV
0      4096.0          0.045230               0.051425              0.346238              0.045214                   0.051406                  0.345479              0.045246                   0.051437                  0.347354        0.000944             0.000355            0.003939
1     16384.0          0.180759               0.205635              1.384368              0.180600                   0.205622                  1.383784              0.180919                   0.205643                  1.384719        0.000944             0.000178            0.001195
2     65536.0          0.720809               0.821871              5.509542              0.718344                   0.821583                  5.497986              0.722399                   0.822077                  5.520216        0.002619             0.000379            0.002502
3    262144.0          2.875647               3.222620             19.814360              2.874386                   3.213494                 19.775497              2.879438                   3.232515                 19.859394        0.000812             0.002902            0.003130
4   1048576.0          8.786459              10.241500             58.596032              8.780279                  10.234403                 58.383962              8.799141                  10.248006                 58.776682        0.001510             0.000658            0.002911
5   4194304.0         15.368254              18.575305             89.136205             14.555469                  18.125147                 88.234263             16.153809                  19.050470                 89.944760        0.040169             0.027046            0.017450
6  16777216.0         21.927561              25.998892            110.101176             21.904659                  25.994424                109.902104             21.952351                  26.003124                110.295157        0.001060             0.000165            0.001898
7  67108864.0         23.479745              27.945724            118.175422             23.471203                  27.936418                118.097838             23.491448                  27.958579                118.261008        0.000419             0.001028            0.000936
fp16-fp32:
            N  fp16->fp32-GB/s  fp32->fp16-rtne-GB/s  fp32->fp16-rtz-GB/s  fp16->fp32-GB/s-min  fp32->fp16-rtne-GB/s-min  fp32->fp16-rtz-GB/s-min  fp16->fp32-GB/s-max  fp32->fp16-rtne-GB/s-max  fp32->fp16-rtz-GB/s-max  fp16->fp32-CV  fp32->fp16-rtne-CV  fp32->fp16-rtz-CV
0      4096.0         0.151311              0.495884             0.640500             0.150943                  0.492308                 0.635236             0.151929                  0.499512                 0.644836       0.003135            0.011283           0.007796
1     16384.0         0.601910              1.978744             2.520615             0.600234                  1.967339                 2.497561             0.602707                  1.988349                 2.531520       0.002605            0.005164           0.006579
2     65536.0         2.398829              7.886402            10.105782             2.395322                  7.876923                10.039216             2.402346                  7.914976                10.176397       0.001565            0.002228           0.006945
3    262144.0         9.556836             30.840469            36.459528             9.547785                 30.624299                36.368480             9.581287                 30.942398                36.653244       0.001869            0.005031           0.004931
4   1048576.0        32.911989             66.788275            85.773093            32.572565                 66.534007                84.836246            33.098991                 67.216412                86.802651       0.007563            0.004307           0.009884
5   4194304.0        52.632751            111.535810           125.427760            51.940559                108.323965               124.356737            54.398006                116.095662               126.884793       0.020584            0.029834           0.009612
6  16777216.0        83.593503            183.247404           166.796403            83.332753                182.726495               165.182107            83.919648                183.590298               168.014093       0.002849            0.002184           0.008465
7  67108864.0        93.211930            206.396724           183.397635            93.128115                205.986829               183.197372            93.279308                206.789128               183.574224       0.000715            0.001967           0.001007
bf16-fp32:
            N  bf16->fp32-GB/s  fp32->bf16-rtne-GB/s  fp32->bf16-rtz-GB/s  bf16->fp32-GB/s-min  fp32->bf16-rtne-GB/s-min  fp32->bf16-rtz-GB/s-min  bf16->fp32-GB/s-max  fp32->bf16-rtne-GB/s-max  fp32->bf16-rtz-GB/s-max  bf16->fp32-CV  fp32->bf16-rtne-CV  fp32->bf16-rtz-CV
0      4096.0         0.216262              0.496485             0.140877             0.215852                  0.494686                 0.140659             0.217133                  0.497570                 0.141047       0.003744            0.003428           0.001148
1     16384.0         0.858250              1.979939             0.562637             0.856904                  1.976834                 0.561866             0.860504                  1.988349                 0.563411       0.002169            0.003098           0.001038
2     65536.0         3.427615              7.867467             2.246307             3.413333                  7.839234                 2.243769             3.442017                  7.884504                 2.248079       0.003163            0.003395           0.001256
3    262144.0        13.554499             30.464150             8.862205            13.518151                 30.340742                 8.853823            13.596681                 30.538676                 8.868200       0.004282            0.004607           0.001022
4   1048576.0        27.018191             63.879128            18.462470            26.969546                 63.689016                18.440067            27.092187                 64.125246                18.476459       0.002321            0.004308           0.001328
5   4194304.0        41.347635            109.755436            29.842077            41.139987                103.226614                28.489266            41.689568                116.095650                30.753634       0.005780            0.056834           0.043698
6  16777216.0        66.565688            172.773959            46.072265            66.323594                172.434798                46.043690            66.741518                173.189538                46.121663       0.002899            0.002157           0.000737
7  67108864.0        73.298160            193.933828            49.872075            73.071499                193.687548                49.631154            73.524945                194.266189                50.059130       0.002932            0.001379           0.003740

@AndreyPavlenko
Copy link
Contributor Author

The benchmarks failed with OOME, but I see the same issue on the main branch https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/17038905956/job/48297736887 .

@whitneywhtsang
Copy link
Contributor

The benchmarks failed with OOME, but I see the same issue on the main branch https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/17038905956/job/48297736887 .

It should be fixed by #4913.

@AndreyPavlenko AndreyPavlenko force-pushed the AndreyPavlenko/bench-fp8 branch from c1d4de6 to a1688f8 Compare August 19, 2025 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[benchmarks] Add fp8 types to the conversion benchmark
3 participants